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Abstract 

The  problem  of  Write- All  — using  P-processors  write  l’s  into  all  locations  of  an  array  of 
size  N,  where  P  <  N —  has  been  used  as  the  basic  building  block  for  constructing  efficient 
and  fault-tolerant  parallel  algorithms.  All  previous  Write-All  solutions  use  fi(P)  auxiliary 
shared  memory  and  assume  that  this  memory  is  cleared  or  initialized  to  some  known  value. 
When  Write-All  building  blocks  are  used  in  polylogarithmic  parallel  time  algorithms  (e.g., 
to  compute  prefix  sums  or  list  ranking)  auxiliary  memory  initialization  cannot  be  amortized 
over  the  computation.  Thus,  assuming  clear  memory  is  a  very  strong  precondition  and  for 
Write-All  itself  ruses  a  legitimate  “chicken-or-egg”  objection. 

In  this  note,  using  a  deterministic  bootstrapping  and  balancing  argument,  we  show 
how  to  Write- All  when  auxiliary  memory  is  contaminated  with  arbitrary  values.  For  any 
dynamic  pattern  of  fail-stop,  no-restart  errors  on  aCRCW  PRAM  with  at  least  one  surviving 
processor,  our  new  algorithm  writes  all  l’s  using  0(N+P\og3  AT/(loglogJ  N))  work,  without 
any  initialization  assumption.  This  technique  can  be  combined  with  any  Write- All  algorithm 
to  yield  efficient  simulations  of  any  PRAM  and  even  optimal  simulations  given  processor 
slack.  It  can  also  be  used  with  restartable  fail-stop  processor  simulations.  In  addition, 
we  show  that  for  the  parallel  prefix  computation  it  is  possible  to  improve  on  the  best 
deterministic  simulations  to  date:  by  a  factor  of  log  N  when  the  memory  is  clear  and  by  a 
factor  of  log  log  N  when  the  memory  is  contaminated. 
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aastcs.brovn.edu,  and  Digital  Equipment  Corporation,  LKGt-t/Tt,  550  King  Street,  Littleton,  MA  01 460, 
USA.  This  research  was  supported  by  Digital  Equipment  Corp.  and  ONR  grant  N00014-91-J-1613. 
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1  Introduction 

Related  work  and  motivation: 

The  study  of  fault-tolerant  and  asynchronous  parallel  algorithms  for  the  parallel  random  access 
machine  (pram  [8])  has  attracted  a  fair  amount  of  recent  attention.  Several  efficient  algorithms 
have  been  designed  for  prams  that  are  subject  to  stop-failures  or  to  processor  delays,  where 
this  processor  behavior  is  determined  by  adversaries  of  varying  strengths.  For  example:  asyn¬ 
chronous  prams  are  the  subject  of  [1, 4, 5, 6, 9, 13, 18, 19],  and  fault-prone  prams  are  studied  in 
[4,  11,  12,  13,  20].  The  motivation  of  this  research  area  is  to  bridge  the  gap  between  realizable 
parallel  computers  and  the  pram,  with  its  unrealistic  features  of  broad  bandwidth  memory 
access,  processor  synchrony  and  freedom  from  faults.  Our  work  is  in  the  area  of  asynchronous 
and  fault-prone  models,  but  we  do  use  broad  bandwidth  access  to  shared  memory  as  a  means  of 
providing  redundancy  when  encountering  faults.  For  a  detailed  discussion  of  the  general  model 
used  and  how  it  can  be  realized  see  [4]. 

Here,  we  reexamine  the  key  problem  of  Write-All  and  remove  a  strong  initialization  as¬ 
sumption  that  has  been  used  in  all  its  previous  solutions.  Write- All  was  formulated  in  [11] 
in  order  to  show  that  it  is  possible  to  combine  efficiency  and  fau! 1  tolerance  in  the  presence 
of  arbitrary  dynamic  fail-stop  pram  processor  errors.  Its  solutions  have  been  used  to  compile 
pram  algorithms  for  architectures  where  asynchrony  or  processor  failures  are  present.  It  can 
be  formulated  as  follows: 

Using  P -processors  write  l’s  into  all  locations  of  an  array  of  size  N,  where  P  <  N . 

Write-All  captures  the  computational  progress  that  can  be  naturally  accomplished  in  unit 
time  by  a  pram  (when  P  =  N).  In  the  presence  of  asynchrony  or  failures,  efficient  solutions 
to  Write-All  (increasing  the  fault-free  work  by  polylogarithmic  factors  only)  are  non-obvious. 
Note  that,  in  all  existing  solutions  it  does  not  matter  what  is  the  initial  state  of  the  size  N  array. 
For  example  we  assume  it  is  all  0’s  in  [11,  4,  20],  but  the  algorithms  would  work  even  if  the 
N  locations  were  initialized  using  arbitrary  0’s  and  l’s.  A  much  more  important  assumption 
in  all  previous  Write-All  solutions  was  the  initial  state  of  additional  auxiliary  memory  used 
(typically  of  H(P)  size).  The  basic  assumption  has  been  that: 

The  Sl(P)  auxiliary  shared  memory  is  cleared  or  initialized  to  some  known  value. 

In  theory,  this  is  a  natural,  even  if  unstated  assumption,  for  prams  [8]  and  RAMS  (cf.,  Turing 
Machine  auxiliary  tapes  are  initially  blank).  However,  given  the  definition  of  Write-All  this 
dependence  on  clear  space  raises  a  legitimate  “chicken-or-egg”  objection.  In  practice,  memory 
locations  typically  contain  unpredictable  values,  and  processes  that  need  to  use  large  blocks  of 
memory  cannot  assume  that  it  is  cleared  or  is  initialized  tc  a  known  value.  In  fact  operating 
systems  usually  provide  explicit  services  that  allocate  clear  memory,  e.g.,  c&Uoc()  in  standard 
C  libraries.  Such  allocation  is  predictably  much  more  time  consuming,  even  in  the  absence  of 
failures. 

It  is  easy  to  construct  simple  Write- All  algorithms  that  do  not  assume  clear  shared  memory, 
but  they  appear  to  use  quadratic  work.  If  the  overall  computation  involves  many  steps,  one  can 
perhaps  afford  an  expensive  initialization  phase  and  amortize  its  cost  over  subsequent  efficient 
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steps.  Unfortnately,  when  Write-All  building  blocks  are  used  in  very  fast  (i.e.,  polylogarith- 
mic  parallel  time)  algorithms  (e.g.,  to  compute  prefix  sums  or  list  ranking)  auxiliary  memory 
initialization  cannot  be  amortized  over  the  computation.  Fortunately,  we  show  that  there  is  a 
way  around  this  dilemma: 

We  present  Write-All  algorithms  and  algorithm  simulations  that  do  not  require  that 

the  auxiliary  memory  is  cleared  prior  to  the  computation. 

Algorithms  in  the  setting  studied  in  the  present  paper  have  some  similarities  with  the 
notion  of  a  self- stabilizing  system  introduced  by  Dijkstra  in  [7].  Paraphrasing  [7],  a  system  is 
self-stabilizing  if  and  only  if,  regardless  of  the  initial  state  the  system  can  always  make  a  state 
transition  into  another  state,  and  the  system  is  guaranteed  to  find  itself  in  a  legitimate  state 
after  a  finite  number  of  transitions.  Our  computations  using  initially  contaminated  memory  can 
be  viewed  as  self-stabilizing  with  respect  to  the  state  of  shared  memory.  In  order  to  describe  our 
technical  contributions  we  must  now  review  the  state-of-the-art  of  the  algorithmics  of  Write-All. 

For  the  worst  case  on-line  stop-failures  without  restarts,  Kanellakis  and  Shvartsman  [11]  gave 
an  efficient  (within  a  log2  factor)  algorithm  for  Write-All  (algorithm  W)  and  other  key  problems 
using  an  iterated  Write-All  paradigm.  This  paradigm  was  then  employed  independently  by 
Kedem  et  al.  [12]  and  Shvartsman  [20]  to  extend  the  results  of  [11]  to  arbitrary  pram  algorithms. 
In  addition,  Kedem  et  al.  [12]  analyzed  the  expected  behavior  of  several  solutions  to  Write-All 
using  a  random  failure  model.  Shvartsman  [20]  presented  a  deterministic  optimal  O(N)  work 
execution  of  pram  algorithms  subject  to  worst  case  failures  by  exploiting  parallel  slackness 
with  P  <  N/  log2  N.  A  simple  randomized  Write- All  algorithm  that  can  be  used  for  simulating 
arbitrary  pram  algorithms  on  an  asynchronous  PRAM  is  presented  by  Martel  et  al.  in  [18]; 
this  simulation  has  very  good  expected  performance  when  the  adversary  is  off-line.  Kedem 
et  al.  [13]  have  shown  an  il(N  log  N)  lower  bound  on  work,  for  any  deterministic  Write-All 
solution.  In  addition,  they  have  shown  an  0(N deterministic  work  upper  bound  on 
Write-All.  Their  upper  bound  is  based  on  a  variation  of  algorithm  W,  and  it  has  been  shown 
by  Martel  [16]  that  the  same  upper  bound  applies  to  algorithm  W  [11]. 

For  the  worst  case  on-line  stop-failures  with  restarts  there  has  also  been  some  progress.  A 
parallel  model  where  processors  are  subject  to  failures  and  restarts  is  examined  by  Buss  et 
al.  in  [4].  This  framework  generalized  previous  models  of  robust  parallel  computations  and  in 
it  Write-All  has  a  subquadratic  0(A1,59)  work  solution.  Martel  et  al.  [17]  presented  several 
randomized  solutions  for  list  ranking  and  sorting  that  have  very  efficient  expected  work  when 
the  scheduling  adversary  is  off-line.  An  efficient  randomized  solution  for  the  Write-All  problem 
was  developed  by  Anderson  and  Woll  in  [1]  for  the  asynchronous  parallel  model.  They  have 
also  showed  an  existence  proof  for  an  algorithm  achieving  work  0(JV1+e)  for  any  e  >  0.  General 
synchronous  pram  simulations  are  impossible  using  bounded  resources  on  asynchronous  prams 
because  of  the  impossibility  result  shown  by  Herlihy  [10].  However  the  algorithms  in  [1]  can  be 
used  with  the  restartable  fail-stop  model  defined  by  Buss  et  al.  [4]  (which  restricts  asynchrony). 
We  will  take  advantage  of  this  since  general  simulations  are  possible  in  that  model. 
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Contributions: 

We  eliminate  the  assumption  that  any  amount  of  clear  initial  memory  is  available  for  the  fail- 
stop  and  fail-stop  restartable  algorithms.  We  develop  deterministic  fault-tolerant  algorithms 
that  can  be  used  to  simulate  prams  using  contaminated  memory,  i.e.,  when  the  shared  memory 
not  containing  the  input  is  initially  in  an  arbitrary  and  possibly  illegal  state.  We  also  improve 
on  the  state-of-the-art  robust  prefix  sums  computations.  More  specifically: 

1.  In  the  no- restart  fail-stop  parallel  model,  any  JV-processor  pram  algorithm  that  runs 
in  time  r  can  be  deterministically  simulated  Using  0(N)  contaminated  memory  on  P 
fail-stop  processors  with  work  0(N  +  P  log3 4  N/(log  log  T1 V)2  +  r  •  P  log2  N /  log  log  N )  for 

1  <  P  <  N. 

This  simulation  has  an  optimal  range  of  processors,  i.e.,  the  work  of  the  simulation  is 
asymptotically  equal  to  the  work  of  the  simulated  non-fault-tolerant  algorithm. 

2.  In  the  restartable  fail-stop  model,  any  TV-processor  pram  algorithm  that  runs  in  time 
r  can  be  simulated  using  0(N)  contaminated  memory  on  P  =  N  restartable  fail-stop 
processors  with  S  =  0(t  ■  N1+e). 

3.  For  the  parallel  prefix  computation  it  is  possible  to  improve  on  the  oblivious  simulations 
of  non-fault-tolerant  algorithm  (e.g.,  the  ones  we  get  by  using  [12,  20]  with  conventional 
algorithms).  In  order  to  compute  the  prefix  sums  of  N  values  using  N  processors,  at 
least  log  N/  log  log  N  parallel  steps  are  required  [2, 15],  and  the  known  algorithms  require 
at  least  log  N  steps.  Therefore  an  oblivious  simulation  of  a  known  prefix  algorithm  will 
require  simulating  at  least  log  N  steps.  We  improve  this  work  of  oblivious  deterministic 
simulation  by  a  factor  of  log  N  when  the  memory  is  clear,  and  by  a  factor  of  log  log  N 
when  the  memory  is  contaminated. 

In  the  rest  of  the  paper,  we  present  the  model  in  Section  2,  contamination-tolerant  algo¬ 
rithms  are  in  Section  3,  we  cover  general  simulations  and  algorithm  transformations  in  Section 

4. 

2  Model  and  definitions 

The  basis  of  our  model  is  the  restartable  fail-stop  crcw  pram  that  is  discussed  and  justified 
by  Buss  et  al.  in  [4],  except  that  the  shared  memory  that  does  not  contain  the  input  is 
contaminated: 

1 .  There  are  P  pram  processors.  Each  has  a  unique  processor  identifier  PID  G  {0, . . . ,  P—1}. 

2.  Shared  memory  is  accessible  to  all  processors;  each  processor  has  a  constant  size  private 
memory.  Each  memory  cell  stores  one  word  of  size  Odog  max{  JV,P}). 

3.  The  input  is  stored  in  N  cells  in  shared  memory. 

4.  The  shared  memory  not  containing  the  input  is  contaminated. 
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To  enable  algorithm  termination  and  sensible  accounting  of  resources,  the  work  of  the  pro¬ 
cessors  is  structured  using  update  cycles.  Each  cycle  consists  of  reading  a  small  number  of 
shared  memory  cells,  performing  a  fixed  time  computation,  and  writing  a  small  number  of 
shared  memory  cells.  The  number  of  reads  and  writes  per  cycle  is  fixed,  but  depend  on  the 
instruction  set  of  the  PRAM.  The  fail-stop  with  restart  failure  model  is  defined  as  follows: 

1.  A  failure  pattern  F  (i.e.,  failures  and  restarts)  is  determined  by  an  on-line  adversary,  that 
knows  everything  about  the  algorithm  and  is  unknown  to  the  algorithm. 

2.  Any  processor  may  fail  at  any  time  in  any  update  cycle,  and  it  may  later  restart,  provided: 

(i)  at  any  time  at  least  one  processor  is  executing  an  update  cycle  that  successfully 
completes; 

(ii)  single  bit  writes  are  atomic,  i.e.,  failures  can  occur  before  or  after  a  write  of  a  single 
bit. 

3.  Failures  do  not  affect  the  shared  memory,  but  the  failed  processors  lose  their  private  mem¬ 
ory.  Processors  are  restarted  at  their  initial  state  with  their  PID  as  their  only  knowledge. 

Condition  2(i)  makes  termination  possible.  Update  cycles  also  serve  as  units  of  accounting. 
They  do  not  constrain  the  instruction  set  of  the  PRAM,  however  the  processors  are  not  charged 
for  the  instructions  of  the  update  cycles  that  are  not  completed.  (In  the  absence  of  update  cycle 
accounting,  a  thrashing  adversary  can  force  quadratic  work  for  any  Write-All  solution  [4].) 

A  failure  pattern  F  is  specified  as  a  set  of  triples  <tag,  PID,  t  >  where  tag  is  either  failure 
for  a  processor  failure,  or  restart  for  a  restart,  PID  is  the  processor  identifier,  and  t  is  the  time 
when  the  processor  either  stops  or  restarts.  The  size  of  F  is  defined  as  the  cardinality  |F|. 

The  complexity  measure  completed  work  generalizes  the  Parallel-timex  Processors  product: 

Definition  2.1  Consider  an  algorithm  with  P  initial  processors  that  terminates  in  parallel- 
time  r  after  completing  its  task  on  some  input  data  I  of  size  |J|  =  N,  and  in  the  presence 
of  any  pattern  F  of  failures  and  restarts  of  size  |F|  <  M.  If  Pi(I,F)  <  P  is  the  number  of 
processors  completing  an  update  cycle  at  time  t,  and  c  is  the  time  required  to  complete  one 
update  cycle,  then  we  define  completed  work  as:  5  =  Sn,m,P  =  P<(7,  F)}.  □ 

Remark  1  The  incomplete  work  cycles  are  not  counted  in  S.  When  the  restarts  do  not  occur, 
then  the  maximum  work  spent  in  the  incomplete  cycles  is  bounded  by  0(P),  since  there  can  be 
no  more  than  P  failures.  Therefore,  for  the  fail-stop  no-restart  model,  using  completed  work  5 
yields  the  same  results  as  using  the  available  processor  steps  measure  in  [11]. 

We  use  the  notation  “  Write- All(N ,  P ,  L)”  to  stand  for  an  instance  of  fault-tolerant  Write- 
All  that  uses  P  processors  and  clear  auxiliary  memory  of  size  L  to  initialize  to  1  an  array  of 
size  N . 

Definition  2.2  An  algorithm  that  uses  P  processors  to  solve  a  Write-All  problem  of  size  N  is 
contamination-tolerant,  if  it  is  a  Write- All(N ,  P,  0)  algorithm.  O 
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3  Write- All  algorithms 

The  Write-All  algorithms  and  simulations  based  on  Write-All  paradigm,  e.g.,  [11,  12,  13,  20], 
or  the  algorithms  that  can  serve  as  Write- All  solution,  e.g.,  the  addition  algorithm  in  [5]  or  the 
maximum  finding  algorithm  in  [18],  invariably  assume  that  a  linear  portion  of  shared  memory 
is  either  cleared  or  is  initialized  to  known  values.  Starting  with  a  non-contaminated  portion  of 
memory,  such  algorithms  and  simulations  are  able  to  perform  their  computation  by  “using  up” 
the  clear  memory,  and  concurrently  or  subsequently  clearing  additional  segments  of  memory 
needed  for  future  iterations.  We  develop  an  efficient  Write-All  solution  that  requires  no  clear 
shared  memory. 


3.1  A  Bootstrap  procedure 

We  formulate  a  bootstrap  approach  to  the  design  of  fault- tolerant  Write-All  algorithms,  such 
that  the  auxiliary  memory  is  initially  contaminated.  The  bootstrapping  procedes  in  stages: 

In  stage  1  of  our  procedure,  all  P  processors  clear  an  initial  segment  of  N0  locations  in  the 
auxiliary  memory. 

At  the  stage  i  of  the  procedure,  we  use  P  processors  to  clear  N{+i  memory  locations  with 
the  help  of  A,  memory  locations  that  were  cleared  in  the  stge  i  —  1. 

If  JV1+i  >  Ni  and  N0  >  1,  then  this  procedure  will  clear  the  required  N  memory  location  in 
at  most  N  stages.  Say  r  is  the  final  stage  number,  i.e.,  Nr  =  N. 

Let  P,  be  the  number  of  active  processors  that  initiate  phase  and  define  N-\  =  0.  The 
cost  of  such  a  procedure  is:  Shoot  =  Z!,T=i  S,(N„  P,,  JV,_i)  where  5,  is  the  cost  of  the  Write- 
All(Ni,Pi,Ni-i)  algorithm  used  in  stage  i. 

The  efficiency  of  the  resulting  algorithm  depends  on  the  choices  of  the  particular  Write-All 
solution(s)  used  in  each  stage  and  the  parameters  JV,. 

One  specific  approach  is  to  define  a  series  of  multipliers  Go,  G\,  . . .  ,  Gr  such  that  Ni  = 
n*=0  Gj-  The  high  level  view  of  such  algorithm  is  given  in  Figure  1.  The  algorithm  consists 
of  an  initialization  (lines  02-04)  and  a  parallel  loop  (lines  04-09).  We  use  a  variation  of  this 
scheme  below. 

We  next  use  the  bootstrap  approach  to  construct  and  analyze  contamination-tolerant  Write- 
All  algorithms  in  the  fail-stop  and  restartable  fail-stop  models. 

3.2  Algorithm  Z  for  the  fail-stop  model 

We  use  algorithm  W  of  Kanellakis  and  Shvartsman  [11]  and  its  analysis  by  Martel  [16].  We  call 
algorithm  Z  the  algorithm  that  results  from  using  W  in  each  phase  of  the  bootstrap  procedure. 

We  analyze  algorithm  Z  for  the  following  choice  of  parameters:  we  use  Go  =  log  A,  and 
Gi  =  log  N  (for  i  >  0).  In  the  initialization,  all  P  processors  traverse  a  list  of  size  Go 
sequentially  and  clear  it.  Then,  iteratively,  the  processors  use  algorithm  W  to  clear  increasingly 
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01  forall  processors  PID=0..P  —  1  parbegin  —  Use  P  processors  to  clear  N  memory 
02  Clear  the  initial  block  of  No  =  Go  elements  sequentially  using  P  processors 

03  i  :=  0  — Iteration  counter 

04  while  Ni  <  N  do 

05  Use  a  Write- All  solution  with  data  structures  of  size  Ni 

06  and  C7,+ j  elements  at  the  leaves 

07  to  clear  memory  of  size  Ni+\  =  Ni  ■  G«+i 

08  i  :=  i  +  1 

09  od 

10  parend 


Figure  1:  A  high  level  view  of  the  bootstrap  algorithm. 

larger  sections  of  memory  using  the  auxiliary  memory  cleared  in  the  previous  iteration  (Fig.  1, 
lines  05-07). 

Algorithm  IF  is  a  fail-stop  (no  restart)  Write-All  solution.  It  uses  two  full  binary  trees 
(represented  as  heaps  in  memory)  and  it  consists  of  a  loop  in  which  the  active  processors 
synchronously  iterate  through  the  following  phases: 

Wl:  enumerate  the  processors  in  a  bottom-up  traversal  of  the  processor  tree; 

W2:  allocate  the  processors  in  a  divide-and-conquer  top-down  traversal  of  the  progress  tree; 

W3:  work  at  the  leaves;  and 

W4:  evaluate  progress  in  a  bottom-up  traversal  of  the  progress  tree. 

To  avoid  a  complete  restatement,  the  reader  is  urged  to  refer  to  [11].  Martel  showed  the 
following  upper  bound  for  algorithm  W: 

Theorem  3.1  [16]  Algorithm  IF  with  P  processors,  the  progress  tree  with  H  leaves  (P  <  H) 
and  2 H  —  1  total  nodes  all  initialized  to  zero  and  G  array  elements  at  each  leaf,  has  the  work 
of  5  =  0{(H  +  Plog  H/  log  log  H  )  •  (log  P  +  log  H  +  G))  for  any  pattern  of  stop-failures. 

Note  that  the  above  result  and  algorithm  IF  can  be  used  when  P  >  H.  As  described  in 
[4],  when  there  are  P  processors  and  the  progress  tree  has  H  <  P  leaves,  then  it  is  sufficient 
for  each  processor  to  take  its  PID  modulo  H  to  assure  uniform  initial  assignment  of  processors 
and  to  preserve  the  result. 

Algorithm  IF  stores  its  binary  trees  as  linear  arrays  interpreted  as  heaps.  Therefore  the 
structure  of  the  trees  is  unaffected  by  the  state  of  the  memory,  because  the  heaps  are  imlicit. 
We  next  observe  that  the  enumeration  of  the  processors  in  phase  Wl  of  algorithm  IF  can 
be  done  in  a  bottom-up  traversal  of  a  contaminated  processor  tree.  The  pseudocode  for  this 
algorithm  is  given  in  Figure  2.  We  call  it  algorithm  Zenum.  The  surviving  processors  enumerate 
themselves  using  a  standard  logarithmic  time  algorithm  based  on  addition.  The  contaminated 
memory  cells  are  distinguished  from  the  cells  that  contain  valid  values  via  the  use  of  a  single 
bit  associated  with  each  cell  (a  so  called  “deadman  flag”).  When  a  processor  arrives  at  a  node, 
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forall  processors  PID  =  0..P  —  1  parbegin 

shared  integer  array  c[1..2A  —  1];  — processor  counts 
shared  bit  array  o/it»e[1..27V  —  1];  — alive/dead  markers 
private  integer  pn  — enumerated  processor  number 
private  integer  jl,j2,  — left/right  siblings  indices 
<;  — predecessor  index  of  jl  and  j2 
j  1  :=  PID  +  (N  —  1);  — heap-leaf  init 

pn  :=  1; - assume  this  processor  is  no.  1 

c[j  1]  :=  1;  — a  processor  is  counted  once  in  this  step 
for  l..log(P)  do  — traverse  the  tree  from  leaf  to  root 
t  :=  j  1  div  2;  — parent  of  jl  and  j2 
if  2  *t  =  jl 

then  j2  :=  jl  •+  1  — jl  came  from  left 
else  j 2  :=  jl  —  1  — jl  came  from  right 

fi; 

alive[j 2]  :=  0  — mark  siblings  dead 
a/it>e[jl]  :=  1  — mark  self  alive 

if  alive[j2]  =  1 - both  sub-trees  have  active  processors? 

then  c[t]  :=  c(jl]  +  c[j2] - both  branches  art  active 

if  jl  >  j 2 - jl  came  from  right,  update  processor  n  umber 

then  pn  :=  pn  +  c[j 2] 
fi 

else  e[<]  :=  e[jl] - all  siblings  failed 

fi; 

jl  :=  t  — advance  up  the  heap 
od 
parend 


Figure  2:  Contamination  robust  processor  enumeration  Zenum  ■ 


it  clears  the  bit  associated  with  its  sibling,  then  it  sets  its  own  bit  (lines  16-17).  Only  cells  that 
have  valid  values  written  in  them  by  active  processors  will  have  the  bit  set.  The  enumeration 
itself  is  as  in  phase  Wl. 

Theorem  3.2  Algorithm  Z  is  a  contamination- tolerant  Write-All(N,P,Q)  algorithm  that  fo 
any  pattern  of  fail-stop  errors  has  S  =  0{N  +  Flog3  N/ (log log  N)2)  for  1  <  P  <  N . 

Proof:  We  first  evaluate  and  then  total  the  work  of  the  algorithm  during  each  of  the  finite 
numbers  stages  of  its  execution.  In  each  use  of  algorithm  W,  we  will  have  G  —  log  AT  as  the 
number  of  memory  locations  associated  with  each  leaf  of  the  progress  tree,  and  we  will  apply 
Theorem  3.1  with  different  instantiations  of  H  to  evaluate  the  upper  bound  of  work. 

Stage  0:  Enumerate  processors  using  Zenum,  then  sequentially  clear  log  A  memory  using  all 
surviving  processors.  The  work  using  the  initial  Fo  <  F  processors  is:  Wo  =  Fo-logF+Fo-log  N. 

Stage  1:  P\  <  Po  <  F.  Using  instance  of  Theorem  3.1  where  H  =  log  N,  the  work  is: 

Wi  =  (log  N  +  Fi  log  log  Nf  log  log  log  N )  •  (log  Fi  +  log  A  +  log  log  A). 
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Stage  i:  Pi  <  P,~\  <  N.  Using  instance  where  H  =  log*  N : 

Wi  =  (log1  N  +  Pi  ■  i  log  log  N/ (log  i  +  log  log  log  N))  •  (log  P,  +  log  N  +  i  log  log  N) 
The  Final  Stage  r  is  when  logT  N  =  N/logN ,  i.e.,  r  =  Toffo^N  ~ 

Totalling  the  work  in  all  phases  yields: 


5  =  £  W,  =  W0  +  £  (log*  TV  +  /> 
«=o  «=i  ' 


t  log  log  IV 


log i  +  log  log  log  N 


j  (log  Pi  +  log  N  +  *  log  log  IV) 


Simplifying  the  sum  results  in  S  =  0(iV  +  Plog3  iV/(loglog  iV)2).  □ 


This  approach  has  the  following  range  of  optimality: 


Theorem  3.3  Algorithm  Z  is  a  contamination-tolerant  Write-All(N,  JV(logloglV)2/log3iV,0) 
algorithm  with  S  =  O(N)  for  any  pattern  of  fail-stop  errors. 


3.3  Algorithm  ZT  for  the  restartable  fail-stop  model 

Algorithm  Zr  is  similar  to  algorithm  Z ,  except  that  in  each  stage  we  will  be  utilizing  a  restartable 
Write-All  algorithm.  (Algorithm  W  that  is  not  suitable  when  restarts  are  allowed,  see  [4]). 
Other  parameters  of  the  bootstrap  procedure  are  the  same  as  for  the  fail-stop  case. 

In  this  analysis,  we  will  be  using  an  algorithm  that  was  described  and  characterized  with 
the  following  result  by  Anderson  and  WoU: 

Theorem  3.4  [1]  There  exists  a  Write- All(H ,  H,H)  solution  with  H  processors  that  has  work 
0(H1+e)  for  every  e  >  0. 

This  is  an  existential  result,  and  we  call  this  algorithm  AW .  The  best  known  constructed 
deterministic  algorithm  has  e  =  log23  -  1  <  0.59  as  was  shown  by  Buss  et  al.  [4]  (algorithm 
X ,  that  can  also  be  used  with  the  bootstrap).  Note  that  algorithm  AW  was  developed  for  the 
asynchronous  model,  but  it  can  be  used  in  the  restartable  fail-stop  model  as  well.  The  work  of 
the  algorithm  in  the  asynchronous  model  is  the  same  as  its  completed  work  in  the  restartable 
fail-stop  model. 

Theorem  3.5  Algorithm  ZT  is  a  contamination-tolerant  Write-All(N ,  N ,0)  algorithm  that  fo 
any  pattern  of  fail-stop  errors  has  S  =  0(iV1+e)  for  any  e  >  0. 

Proof:  We  first  note  that  there  exists  a  Write- All(H ,  P,  H)  solution  with  P  >  H  processors 
that  has  work  0(P1+f)  for  every  £  >  0.  We  use  algorithm  AW,  except  all  processors  use  their 
PIDs  modulo  H .  The  worst  case  work  is  achieved  when  up  to  \jf[  processors  that  have  the 
same  PID  module  H  operate  synchronously  as  a  single  processor.  The  work  of  the  algorithm  in 
this  case  is:  5  =  •0(P1+e)  =  0(P1+e).  Using  this  algorithm  at  each  stage  of  the  bootstrap 

procedure,  and  evaluating  the  total  work  as  in  Theorem  3.2  yields  the  desired  result: 
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We  evaluate  and  then  sum  the  work  of  the  algorithm  during  each  of  the  finite  numbers 
stages  of  its  execution.  In  each  stage  i  >  1  of  algorithm  ZT,  we  will  use  algorithm  AW  log  N 
times  to  clear  log‘+1  N  memory  locations.  In  each  instance  of  use  of  Theorem  3.4,  we  will  use 
6  >  0  as  the  exponent,  such  that  e/2  =  6.  This  is  done  to  simplify  the  final  sum  using  the 
property  that  log  N  =  0(NS)  for  any  6  >  C  We  also  use  P  =  N  for  clarity. 

Stage  0:  All  processors  linearly  initialize  the  segment  of  shared  memory  of  length  log  N  using 
The  work  is:  Wo  =  P  •  log  N. 

Stage  1:  The  algorithm  is  applied  log  N  times  to  clear  a  segment  of  shared  memory  of  size 
log2  N .  Using  instance  where  H  =  log  A,  the  work  is:  W\  =  (P  log4  N)  •  log  N. 

Stage  i:  Using  instance  H  =  log*  N:  W,  =  (P(log‘  N)s N)  •  log  N  =  (Flog'5  N)  •  log  N. 

Final  Stage  r  where  logT  N  =  N/  log  A,  i.e.,  r  =  log  N/  log  log  N  —  1.  Using  the  instance  where 
H  =  logT  A  =  A/ log  A,  the  work  is:  WT  =  (P(logT  N)6)  •  log  N  =  (P(A/log N)6)  •  log N  = 
P  ■  N&  log1-5  N. 

T  T 

S  =  Wi  =  Wo  +  X^log*5  N)-\ogN  =  0(Nl+s log1'6  N)  =  0(N1+s\ogN)  =  0(N1+'). 


4  Algorithm  simulations  and  algorithm  transformations 

4.1  Oblivious  simulations 

Using  general  simulation  techniques  [12,  20],  if  SW(N,P)  is  the  efficiency  of  solving  a  Write-All 
instance  of  size  N  using  P  processors,  and  if  a  linear  amount  of  clear  memory  is  available,  then 
a  single  jV-processor  pram  step  can  be  deterministically  simulated  using  P  fail-stop  processors 
and  work  SW(N,  P).  Thus  if  the  Parallel-time  X  Processors  of  an  original  A -processor  algorithm 
is  t  ■  N,  then  the  work  S  of  the  fault- tolerant  version  of  the  algorithm  will  be  0(r  •  SW(N,  P)). 

For  the  setting  with  initially  contaminated  shared  memory,  using  algorithms  Z  and  Zr  with 
the  simulation  techniques  [12,  20],  we  obtain  the  following  results: 


Theorem  4.1  Any  A-processor,  r  parallel  time  PRAM  algorithm  can  be  simulated  using  0(A) 
contaminated  memory  and  F  fail-stop  crow  processors  with  S  =  0(P  log3  A/(loglog  A)2  +  r  ■ 
A  +  t  •  Flog2  A/  log  log  A)  for  1  <  F  <  A. 

This  simulation  has  optimal  ranges: 

Corollary  4.2  Any  A-processor,  r  parallel  time  pram  algorithm  can  be  simulated  using  0(A) 
contaminated  memory  and  F  fail-stop  crcw  processors  with  S  =  0(r  •  A)  when: 

(1)  1  <  F  <  A  (log  log  A)2/  log3  A),  or 

(2)  1  <  F  <  A  log  log  A/log2  A)  and  r  >  log  A/ log  log  A . 
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In  the  restartable  fail-stop  model  we  get: 

Theorem  4.3  Any  JV-processor,  r  parallel  time  pram  algorithm  can  be  simulated  using  O(N) 
contaminated  memory  and  N  restartable  fail-stop  crcw  processors  with  S  =  0((1  +  t)-  N1+s). 

Remark  2  Buss  et  al.  [4]  define  an  amortized  complexity  measure  of  overhead  ratio  a  that 
measures  the  computational  overhead  of  an  algorithm  relative  to  the  necessary  work  and  the 
number  of  failures  that  are  encountered.  The  simulation  in  the  restartable  fail-stop  model  has 
overhead  ratio  per  PRAM  step  of  o  —  Ne.  This  overhead  ratio  can  be  made  polylogarithmic  by 
interleaving  algorithm  ZT  with  algorithm  V  as  presented  in  [4]. 


4.2  Improving  oblivious  simulations 

In  addition  to  serving  as  *he  basis  for  oblivious  simulations,  any  solution  for  the  Write-All 
problem  can  also  be  readily  used  as  a  building  block  for  custom  transformations  of  efficient 
parallel  algorithms  into  robust  ones  [11],  Custom  transformations  are  interesting  because  in 
some  cases  it  is  possible  to  improve  on  the  work  of  the  naive  oblivious  simulation.  These 
improvements  are  most  significant  for  fast  algorithms  when  a  full  range  of  processors  is  used, 
i.e.,  when  N  are  used  to  simulate  N  processors,  because  in  this  case  the  parallel  slack  cannot 
be  taken  advantage  of.  For  example  in  the  models  with  clear  initial  memory,  a  factor  of 
log  Ar/ log  log  A  was  saved  off  the  pointer  doubling  simulations  [11],  and  using  randomization 
and  off-line  adversaries,  improvements  can  be  obtained  in  expected  work  of  other  algorithms 
[17,  18]. 

We  next  show  how  to  obtain  determinsitic  savings  in  work  for  the  prefix  sums  algorithm  that 
occurs  in  solutions  of  several  important  problems  [3].  Efficient  parallel  algorithms  and  circuits 
for  computing  prefix  sums  were  given  by  Ladner  and  Fischer  in  [14],  where  the  prefix  problem 
is  defined  as  follows:  Given  an  associative  operation  ©  on  a  domain  V,  and  xi,...,xn  £  V, 
compute,  for  each  k,  (1  <  k  <  n)  the  sum  ©,fc=1  x,. 

Prefix  sums  can  be  computing  robustly  by  using  a  naive  simulation  of  a  standard  logarithmic 
time  algorithm.  When  using  P  =  N  processors,  the  work  of  such  simulation  will  be  0(Sw-logN). 

Prior  to  dealing  with  prefix  sums,  we  make  a  simple  observation  that  improves  on  another 
general  simulation.  It  follows  from  the  fact  that  since  algorithms  W  and  AW,  by  their  definition 
implement  tree  traversals,  they  can  be  used  to  implement  an  associative  operation  on  N  values: 

Theorem  4.4  Given  an  associative  operation  ©,  and  an  array  x[l.JV],  then  ©-^.j  x[i]  can  be 
computed  using  N  fail-stop  processors  at  a  cost  of  a  single  application  of  algorithms  Z  (or  ZT). 

This  saves  a  full  log  N  factor  over  oblivious  simulations.  We  extend  Theorem  4.4  and  show  a 
robust  prefix  sum  algorithm  whose  work  complexity  is  0(SW).  In  the  no-restart  fail-stop  model 
we  have  the  following  result: 
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Lemma  4.5  Parallel  prefix  for  N  values  can  be  computed  using  N  non-restartable  fail-stop 
processors  using  0(N)  clear  memory  with  5  =  0(N\og2  N/  log  log  TV). 

Proof:  The  prefix  summation  algorithm  that  we  are  going  to  use  as  the  basis,  is  an  itera¬ 
tive  version  of  the  recursive  algorithm  of  [14].  The  algorithm  consists  of  two  stages:  (1)  a 
binary  summation  tree  is  computed,  and  (2)  each  prefix  sum  is  computed  from  the  summation 
tree  obtained  in  the  first  stage,  each  prefix  sum  requires  no  more  than  logarithmic  number  of 
additions. 

Each  of  the  two  stages  can  be  performed  in  logarithmic  time  in  parallel  by  up  to  N  pro¬ 
cessors.  To  produce  the  robust  version  of  the  above  algorithm,  we  implement  the  above  stages 
using  the  controls  of  algorithm  W  with  appropriate  modifications  as  follows: 

1.  In  the  first  stage,  a  binary  summation  tree  is  computed  in  bottom  up  traversals  at  the 
same  time  when  the  progress  tree  of  algorithm  W  is  being  updated.  This  modification  to 
the  algorithm  does  not  affect  its  asymptotic  complexity. 

2.  In  the  second  and  final  stage,  the  work  phase  of  algorithm  W  is  modified  to  include  the 
logarithmic  time  summation  operations  using  the  summation  tree  as  input  (as  in  Theorem 
4.4). 

This  stage  is  shown  in  Figure  3.  In  the  code,  {{i})  is  a  binary  string  representing  the  value 
i  in  binary,  where  most  significant  bit  is  bit  number  0,  and  ..  A  is  the  true/false  value  of 
th Shth  most  significant  bit  of  {{ i )). 

The  loop  in  lines  09-18  is  the  top-down  t.«.versal  '  the  summation  tree.  In  lines  13-17 
the  appropriate  subtree  sum  is  added  (line  14)  at  d*  oth  h  only  if  the  corresponding  bit 
value  of  the  processor  PI D  is  true. 

Therefore  the  work  to  compute  prefix  sums  is  the  same  as  the  worst  case  work  of  algorithm  W. 

□ 


Thus  we  have  realized  a  multiplicative  factor  of  log  N  savings  over  the  oblivious  simulation 
when  the  memory  is  clear. 

Note  that  because  of  the  lower  bounds  shown  by  Beame  and  Hastad  [2]  and  Li  and  Yesha 
[15],  at  least  log  N/  log  log  N  parallel  time  and  at  least  N  log  Nf  log  log  N  work  will  be  required 
by  P  =  N  processors  to  compute  the  prefix  sums  in  the  absence  of  failures.  Therefore  the  mul¬ 
tiplicative  overhead  in  work  of  our  parallel  prefix  algorithm  is  only  log  N  when  using  algorithm 
W  in  the  fail-stop  model. 

Using  Lemma  4.5  we  obtain  the  following  result  when  the  memory  is  contaminated: 

Theorem  4.6  Parallel  prefix  for  N  values  can  be  computed  using  N  fail-stop  processors  and 
O(N)  contaminated  memory  with  S  =  0(iVlog3  A^/(loglog  JV)2). 

Note  that  using  N  processors  to  simulate  a  parallel  prefix  would  require  the  work  (Theorem 
4.1)  5  =  0(N  log3  N/  log  log  IV),  and  so  the  custom  algorithm  saves  a  log  log  N  factor  relative 
to  the  oblivious  simulation. 


REFERENCES 


12 


01  forall  processors  PID  =  0..N  parbegin 
02  shared  integer  array  sum[1..2A  —  1];  — summation  tree 

03  shared  integer  array  prefix[l..N]\  — prefix  sums 

04  private  integer  j,jl,j2,  — current /left/right  indices 

05  h ;  — depth  in  the  summation  tree 

06  j  :=  1;  h  :=  0;  — begin  at  the  root,  and  at  depth  0 

07  prefix[PID]  :=  0;  — initialize  the  sum 

08  while  h  ^  0  do  — traverse  from  root  to  leaf 

09  h  :=  h  +  1;  jl  :=  2*  j ;  j2  :=  jl  +  1  — left/right  indices  at  a  new  depth 

10  if  ((PID))^  — Is  the  sub-sum  at  this  level  included? 

11  then  prefix[PID]  :=  prefix[PID]  +  sum\jl\  — add  the  left  sub-sum 

12  j  :=  j2  — go  down  to  the  right 

13  else  j  :=  jl  — go  down  to  the  left 

14  fi ; 

15  od 

16  parend 


Figure  3:  Second  stage  of  contamination-tolerant  prefix  computation. 
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